perm filename VOTYPE[8,ALS] blob sn#041488 filedate 1973-05-09 generic text, type T, neo UTF8
00010	A.L.Samuel					May 8, 1973
00020	
00030			The Case for the Voice Typewriter
00040	
00050		While the present trend in the computer handling of
00060	speech is largely directed toward Speech Understanding rather
00070	then Speech Transcription, a case can be made for the proposition
00080	that the understanding of speech in the strict sense is a very
00090	much harder task than is the transcription of speech from its
00100	acoustic form to the conventional representation of this speech
00110	in ordinary written form. If speech understanding is harder than speech
00115	transcription then it seems that it might be well to reconsider the
00117	entire direction of current work. Without exception, all of the ARPA
00120	Speech Recognition groups are restricting themselves to very
00130	limited domains of discourse, so limited in fact that it is 
00140	extremely doubtful that their final systems will be of any use
00150	commercially, at least in the forseeable future. The same objections
00160	can be made to the work of several industrial organizations,
00170	altho in those cases where I have any detailed information it seems
00180	to me that a more realistic approach is being followed and
00190	grandiose claims are not being made. A physically realizable
00195	speech transcription system might not be so severely limited in its
00197	scope and might be a more practical system for current exploitation.
00200		
00210		It is my belief that any industrial organization that
00220	wants to cash in on Speech Recognition should have two parallel
00230	programs, one along the conventional speech understanding approach
00240	to guarantee access to information that does come out of the
00250	ARPA work, and a second on speech transcription. The first of
00260	these programs should be conducted with a minimum of commercial
00270	secrecy, just enough to keep the outside world guessing as to
00280	what is being done but not so much as to errect barriers between
00290	the organizations workers and the outside or to lead others to
00300	suspect that there is anything else going on. The second program
00310	could then be run in complete secrecy so that the organization
00320	might reasonably lead the pack in the first commercial exploitation
00330	of a speech transcription system.
00340	
00350		Let me outline my present conception of what a commercial
00360	Speech Transcription system might be like. It is, of course, quite
00370	unrealistic to expect that one can construct a complete set of
00380	specifications for such a system at this time since there are still
00390	a great many unknowns with respect to the specific problems that
00400	will be encountered and the compromises that will have to be mad⎇,
00410	but here goes. In the first place, the first system to be introduced
00420	will undoubtedly require a rather large computer altho at the present
00430	rate of miniaturization and cost reduction this requirement can be
00440	expected to ease considerably. Assuming a rather expensive computer
00450	one would expect that the first users would be the larger companies
00460	that now have rather large typing pools. The system should therefore
00470	be one that could be phased in under these conditions and could make
00480	use of the displaced typists in some capacity so as to increase the
00490	total typing service capabilities of the overall operation. The
00500	better typists could be promoted to private secretarial jobs, the
00510	next level personell would be retained to provide a post editing
00520	function and some saving in costs could be achieved by dispensing
00530	with the rest of the staff either by normal attrition or otherwise.
00540	
00550		Users of the system would dictate their letters to the
00560	system via telephone lines. No attempt to operate in real time would
00570	be made. Instead, a limited amount of processing only would be done
00580	at the time of generation, only enough to get the input into digital
00590	form with perhaps some condensation to save on storage space. There
00600	would be a personallizes data bank for each regular user of the system
00610	which would be used in the subsequent processing. Incoming letters
00620	would be stacked in queues with any desired priority provisions and
00630	then processed from these queues. The processed letters could then
00640	be either typed out directly and delivered to the originator or they
00650	could be displayed to the originator on a scope for editing and
00660	correction. Users rating the service could have their letters delivered
00670	to a secretary or a member of the typing pool for preliminary editing
00680	and correction before the letters would go to the originator. Again,
00690	this could be in the form of hard copy or of a scope display. The
00700	overall system would have to include a rather good editing program
00710	so that the correction and editing of the letter could be done with
00720	a minimum of effort on the part of the corrector.
00730	
00740		It might be well to discuss the minimum requirements as to
00750	accuracy that such a system would have to meet in order to be usable.
00760	The basic requirement would be that the initial output would have to be
00770	completely understandable. A second requirement would certainly be
00780	that the task of correcting the errors must not be more difficult
00790	and time consuming than the task of transcribing the letter by
00800	conventional methods. Perhaps a fairer comparison would be with the
00810	existing magnetic typewriters which provide correction facilities.
00820	Finally one would hope that there would be a sufficient margin between
00830	costs and the quality of output obtainable to make it possible to
00840	develop an adequate market.
00850	
00860		The next question is not so much whether it will be possible
00870	to develop such a transcription system, because there is little doubt
00880	but that it can and will be done, but rather,how long will it take and
00890	what it will cost. No very precise answer can be given to these
00900	questions, except to say that the time will depend on the level of
00910	effort and the cost can be controlled by holding the level of effort
00920	to the minimum amount consistant with the objective of getting there
00930	first. The only practical answer would be to start now with a modest
00940	effort and to speed up the effort or slow it down as dictated by
00950	the rate of progress and the clues one might get as to the
00960	level of outside competition.
00970	
00980		Some general remarks may be in order as to how one might
00990	proceed to develop a voice typewriter and as to how this effort
01000	would depart from work on Speech Understanding. In the first place,
01010	the entire approach should be based upon the initial identification
01020	of objects smaller than words or phrases, since the number of
01030	different words that would have to be stored and referenced for a 
01040	strictly word recognition system to work would be entirely too large.
01050	Word recognition would be used of course to perform the final 
01060	transcription from some pronetic representation into the final written
01070	form since the spelling of English words is illogical and inconsistant
01080	but it should still be possible to render a reasonable transcription of
01090	words that do not appear in the available dictionary. There is some
01100	question as to whether the initial identification should be in terms
01110	of syllables, phonemes or even smaller units. The identification of
01120	these units would be an individuallized matter since speakers differ
01130	in their rendition of phonetic events. The Signature Table approach
01140	as currently under study in the A.I. project at Stanford University
01150	seems to offer a very convenient method of acquiring the individuallizes
01160	data that is needed. Once this initial identification has been made the
01170	subsequent processing is reasonably independent of speaker idiosyncrasies.
01180	From here on the work that is currently being done by the various ARPA
01190	groups can be called upon for ideas as to how to proceed. However in
01290	the case of speech transcription use would be made of phonological
01390	rules, dynamic pattern matching, linguistic constraints and meaning
01490	to resolve ambiguities in the classification of the basic units, 
01590	and not to establish meaning per se. By way of contrast, the emphasis
01690	of the speech understanding work is on the establishment of meaning
01790	without any serious attempt to identify the written equivalent of what 
01890	was actually said.
01990	
02090		Finally one must draw a clear distinction between the approach
02190	that should be taken to acquire the information that will be needed
02290	to design a speech transcription system and the actual design of th
02390	system. Clearly there is a real advantage in terms of flexibility
02490	in doing every thing by programming during the study phase while a
02590	real gain in speed for the final operating system can be obtained
02690	by designing special purpose hardware for the execution of highly
02790	repetative portions of the analysis. This distinction has all too
02890	often been obscured by workers in the field who start their study
02990	by making restrictive choices as to hardware before information is
03090	available to permit an intelligent choice.